Introduction

In this handout we will use the data analytic workflow/FACT framework to make predicitons about future sales. Our focus will be on data modeling. The steps and the details that we will focus on are listed next.

Steps

  1. F rame a question: What are future sales predicted to be for a campaign?
  2. A ssemble data: The data has already been gathered and cleaned.
  3. C alculate results
  • First using EDA: We have already removed missing values, and cleaned it up quite a bit; however, we will use EDA to understand univariate and bivariate relationships to reinforce the regression calculations.
  • Second using regression: We will illustrate both backward elminiation and forward selection for creating models that we can use for prediction.
  1. T ell others about the results: We will use the models to make predictions, and then plot the predictions on plot to help us identify the optimal allocation of sales dollars.

Load libraries and data

Let’s first load the packages that we will use, read in the data, and look at its structure.

library(tidyverse) # For lots of tidyverse packages and functions
library(magrittr) # For pipe symbol
library(GGally) # For nice looking pairplots
library(plotly) # For interactive plots like in Alteryx

# Read in data
df <- readRDS('adSpend.rds') # Fictitious data for illustrating regression concepts
str(df) # Structure
## 'data.frame':    426 obs. of  5 variables:
##  $ sales           : num  1744 2193 962 1988 2387 ...
##  $ adSpendFacebook : num  50 82 13 63 77 70 31 53 63 11 ...
##  $ adSpendInstagram: int  2 76 62 65 61 98 76 4 87 32 ...
##  $ campaign        : chr  "C1" "C1" "C1" "C1" ...
##  $ store           : chr  "A" "A" "A" "A" ...

The adspend data is fictitious data that we will use to illustrate regression concepts. It consists of 426 observations and 5 variables. Here is a brief description about each column:

  1. sales: The dollar value of sales for the duration of the associated ad campaigns.
  2. adSpendFacebook: The dollar value spent on advertising on Facebook.
  3. adSpendInstagram: The dollar value spend on advertising on Instagram.
  4. campaign: The name of the advertising campaign. There are three different campaigns in this dataset.
  5. store: The nominal name of the store.

Check for missing values

Missing values:

colSums(is.na(df))
##            sales  adSpendFacebook adSpendInstagram         campaign 
##                0                0                0                0 
##            store 
##                0

There are no missing values.

Cardinality of Campaign and Store

Let’s get an idea of how many unique values there are for each campaign and store.

summary(factor(df$campaign))
##  C1  C2  C3 
## 142 142 142
summary(factor(df$store))
##   A   B   C   D   E 
##  79  82  72  88 105
table(df$campaign, df$store)
##     
##       A  B  C  D  E
##   C1 23 24 25 30 40
##   C2 18 28 27 37 32
##   C3 38 30 20 21 33
  • There is an equal number of observations for each campaign: 142.
  • There is an uneven number of observations for each store.
    • Store E has the most observations: 105
    • Store A has the fewest observations: 23
  • The table function breaks this down for each campaign/store combination.

EDA (Overall)

Let’s look at a visual representation of the data using a pairplot.

The code cell below creates a really useful set of plots for a dataset that is as small as this. We are only using the first four columns of data: sales, adSPendFacebook, adSpendInstagram, campaign. The store column will not be included in this plot, but we could re-run it by setting color equal to store.

# The ggpairs function is from the GGally package. It creates a rich pairplot
ggpairs(df
        , columns = 1:4 # Only include the first four columns,and leave out the last column, store.
        , aes(color = campaign # Let's look for any difference among campaign. 
              , alpha = .4 # Makes the colors partially transparent
              )
        )

Organization of the pairplot

  • The diagonal shows the univariate distribution for each variable.
    • For a numeric variable, a density plot is used, separated by campaign.
    • For a categorical variable, a barplot is used to show the count of observations.
  • The lower triangle shows bivariate relationships.
    • For two numeric variables, a scatter plot is used.
    • For a numeric and a categorical variable, a faceted histogram is used.
  • The upper triangle quantifies bivariate relationships.
    • For two numeric variables, a correlation is used.
    • For a numeric and a categorical variable, a faceted boxplot is used.

Takeaways from the pairplot

  1. Sales for the campaign 3 are generally lower, as indicated in the distribution plots and box and whisker plots for sales in the first row of the pairplot.
  2. Sales has a strong correlation with adSpendFacebook overall (0.6). However, sales has a very high correlation with adSpendFacebook for campaign 1 (0.93), and a weak correlation with campaign 2 (0.3). You can kind of see these correlations in the scatter plot, but with so many colors, it’s hard to verify.
  3. Sales has a moderate correlation with adSpendInstagram overall (0.47). Sales has a strong correlation with adSpendInstagram for campaign 2 (0.64) and campaign 3 (0.68), and a weak correlation with campaign 1 (0.14). You can kind of see these correlations in the scatter plot, but with so many data points, it’s hard to verify.

Campaign 1 (Backward Elimination)

We will now illustrate steps 3-4 using the campaign 1.

  1. Calculate results
  • First using EDA: We will use scatter plots to look at bivariate relationships.
  • Second using regression: We will use a backward elimination (top down) approach to create a model.
  1. Tell others about the results: We will illustrate how to predict future sales using the prediction data, and then creating a plot of it.

Calculate using EDA

Let’s first create a scatter plot between adspend and sales for the first campaign.

# Create a dataset only for the C1 campaign data
c1 <- df %>% filter(campaign == 'C1')

# Reshape the data from wide to long and then plot it
c1 %>%
  pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
               , names_to = 'ad', values_to = 'spend') %>%
  ggplot(aes(x = spend, y = sales, color = ad)) +
  geom_point()

This highlights the strong positive correlation between sales and adSpendFacebook for campaign 1.

Calculate using Regression

Let’s use regression and use a backward elimination approach to creating a model that we can use for predicting sales from campaign 1.

Backward elimination is when you add in all the variables and then remove those that are insignificant.

Model 1

lmc1_1 <- lm('sales ~ adSpendFacebook + adSpendInstagram + store', data = c1)
summary(lmc1_1)
## 
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram + store", 
##     data = c1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -628.43 -109.58    6.38  116.52  539.06 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      787.3297    61.3859  12.826   <2e-16 ***
## adSpendFacebook   18.5124     0.6219  29.768   <2e-16 ***
## adSpendInstagram   0.5810     0.5852   0.993    0.323    
## storeB           -52.5726    59.4661  -0.884    0.378    
## storeC           -41.7864    58.7861  -0.711    0.478    
## storeD            31.1449    56.3893   0.552    0.582    
## storeE           -23.4717    53.3030  -0.440    0.660    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.4 on 135 degrees of freedom
## Multiple R-squared:  0.8707, Adjusted R-squared:  0.865 
## F-statistic: 151.5 on 6 and 135 DF,  p-value: < 2.2e-16

Observations

  • Notice that R automatically detected that store is a categorical variable and so it created dummy variables to deal with that. Nice!
  • The adjusted R-squared is high. This model would be good for making predictions.
  • Should we remove any variables?
    • The p-value on adSpendInstagram is insignificant. Thus, the coefficient is not reliably different from 0, and we should remove it from the model.
    • The p-values on the store coefficients are all insignificant. Thus, they have the same intercept, and should be removed from the model.
    • The final model that we present would be the one without the adSpendInstagram and store variables.

Insignificance

Let’s represent the insignificance of the adSpendInstagram and store variables.

# Reshape the data from wide to long and then plot it
c1 %>%
  pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
               , names_to = 'Platform', values_to = 'Spend') %>%
  ggplot(aes(x = Spend, y = sales, color = Platform)) +
  geom_point() +
  stat_smooth(method = 'lm', se = F) +
  facet_wrap(vars(store), ncol = 2)

Observations
* The adSpendInstagram line is pretty flat in most of the plots. This means that it stays the same regardless of how much is spent, and is therefore insignificant.
* The intercepts for adSpendFacebook are all about the same. There is a little variation in intercepts for the adSpendInstagram lines. This is why the store variable is insignificant.

Back to model creation

Model 2

Let’s create a simpler regression model by only including adSpendFacebook as a predictor variable.

lmc1_2 <- lm('sales ~ adSpendFacebook', data = c1)
summary(lmc1_2)
## 
## Call:
## lm(formula = "sales ~ adSpendFacebook", data = c1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -585.83 -119.05   13.93  130.50  546.37 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     799.2518    39.9355   20.01   <2e-16 ***
## adSpendFacebook  18.5607     0.6142   30.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.6 on 140 degrees of freedom
## Multiple R-squared:  0.8671, Adjusted R-squared:  0.8661 
## F-statistic: 913.1 on 1 and 140 DF,  p-value: < 2.2e-16

Observations

  • Notice that the adjusted R-squared is still very high, so this model could be reliably used for making predictions of what sales will be for campaign 1 based on how much you plan on spending for Facebook. This is also simpler.
  • Specifically, we can see that each dollar spent on Facebook ads increases sales by about $18.56.

Predictions

Let’s make predictions with this model using the prediction data, and then visualize them in a 3d scatter plot using the plotly package.

# Read in the prediction data
pd <- readRDS('predictionData.rds') # Explore it by clicking on the pd dataframe in the Environment pane.

# Create a new column of predicted sales using the lmsa2 model
pd$sales_pred <- predict(lmc1_2, pd)

# Create an interactive 3d scatter plot
plot_ly(data = pd
        , x = ~adSpendFacebook, y = ~adSpendInstagram, z = ~sales_pred
        , color = ~store
        , size = .5
        , type = 'scatter3d'
        , mode = 'markers'
)

Observations

  • If you orient the plot so that adSpendFacebook is in front, you can see that as you increase the amount spent on Facebook ads, the sales increases, as well.
  • If you orient the plot so that adSpendInstagram is in front, you can see that there is no relationship with sales. The observations fall everywhere.

Campaign 2 (Forward Selection)

Repeat this process with campaign 2, but using forward selection this time.

Calculate Using EDA

Let’s first create a scatter plot between adspend and sales for campaign 2.

# Create a dataset only for the se campaign data
c2 <- df %>% filter(campaign == 'C2')

# Reshape the data from wide to long and then plot it
c2 %>%
  pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
               , names_to = 'Platform', values_to = 'Spend') %>%
  ggplot(aes(x = Spend, y = sales, color = Platform)) +
  geom_point()

It’s hard to tell if either variable has a strong relationship with sales.

Calculate Using Regression

I will use forward selection.

Model 1

lmc2_1 <- lm('sales ~ adSpendFacebook', data = c2)
summary(lmc2_1)
## 
## Call:
## lm(formula = "sales ~ adSpendFacebook", data = c2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1318.38  -437.49     7.31   352.52  1411.34 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1515.677     98.154  15.442  < 2e-16 ***
## adSpendFacebook    6.467      1.710   3.783 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 606.1 on 140 degrees of freedom
## Multiple R-squared:  0.09273,    Adjusted R-squared:  0.08625 
## F-statistic: 14.31 on 1 and 140 DF,  p-value: 0.0002291

Observations
* The coefficient on adSpendFacebook is positive. I’ll keep that variable, and add in another.

Model 2

lmc2_2 <- lm('sales ~ adSpendFacebook + adSpendInstagram', data = c2)
summary(lmc2_2)
## 
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram", data = c2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1016.46  -344.77   -72.54   389.57  1077.69 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       916.436     96.871   9.460  < 2e-16 ***
## adSpendFacebook     5.375      1.320   4.072 7.78e-05 ***
## adSpendInstagram   13.208      1.337   9.876  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 466.3 on 139 degrees of freedom
## Multiple R-squared:  0.4668, Adjusted R-squared:  0.4592 
## F-statistic: 60.86 on 2 and 139 DF,  p-value: < 2.2e-16

Observations
* Both of the coefficients on adSpendFacebook and adSpendInstagram are significant. I’ll keep those and add in store.

Model 3

lmc2_3 <- lm('sales ~ adSpendFacebook + adSpendInstagram + store', data = c2)
summary(lmc2_3)
## 
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram + store", 
##     data = c2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -508.38 -123.83    3.79  126.90  517.26 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       735.0654    65.5871  11.207  < 2e-16 ***
## adSpendFacebook     5.2393     0.5703   9.188 6.32e-16 ***
## adSpendInstagram   14.0230     0.5810  24.135  < 2e-16 ***
## storeB           -431.0324    60.9491  -7.072 7.47e-11 ***
## storeC            523.3755    60.8766   8.597 1.80e-14 ***
## storeD            -84.1483    57.8160  -1.455    0.148    
## storeE            688.4019    59.9669  11.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 199.6 on 135 degrees of freedom
## Multiple R-squared:  0.9051, Adjusted R-squared:  0.9009 
## F-statistic: 214.6 on 6 and 135 DF,  p-value: < 2.2e-16

Observations

  • The adjusted R-squared is really high. This model would be good for making predictions.
  • Sales increase by about $5.24 for every dollar spent on advertisements on Facebook.
  • Sales increase by about $14.02 for every dollar spend on advertisements on Instagram
  • The store variable is very significant. Each store has a different intercept, except for store D. We will leave the store variable in the model.

Predictions

Let’s make predictions with the lmc2_3 model using the prediction data, and then visualize them in a 3d scatter plot using the plotly package.

# Create a new column of predicted sales using the lmc2_3 model
pd$sales_pred_lmc2_3 <- predict(lmc2_3, pd)

# Create an interactive 3d scatter plot
plot_ly(data = pd
        , x = ~adSpendFacebook, y = ~adSpendInstagram, z = ~sales_pred_lmc2_3
        , color = ~store
        , size = .5
        , type = 'scatter3d'
        , mode = 'markers'
)

Observations

  • It’s easy to see that as you increase the amount spent on Instagram ads, the sales increase. Thus, you should allocate all advertising dollars to Instagram.
  • The intercept for stores B, C, and E are much different than for store A.

Campaign 3

Repeat this process with campaign 3. Use either backward elimination or forward selection.

Calculate with EDA

Calculate with Regression

Model 1

Predictions